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Abstract 

This paper examines algorithms for detecting when a property $ holds 
during the execution of a distributed system. The properties we con- 
sider are expressed over the state of the system and are not assumed 
to have properties that facilitate detection, such as stability. 

Detection is done by a monitoring process within the system, which 
cannot perceive an execution of a distributed system as a total order; 
because of this, we consider two interpretations for ‘‘detecting <£'' : 

1* There is an execution consistent with the observed behavior such 
that $ was true at a point in that execution. We refer to this 
property as possibly <£. 

2. For all executions consistent with the observed behavior, there 
was some point in real time at which the global state of the 
system satisfied <£. We refer to this property as definitely 

In this paper, we give formal definitions for these two interpreta- 
tions and present algorithms for them. We give protocols for both 
asynchronous and synchronous systems and, for synchronous systems, 
give upper bounds on the time between the occurrence of the property 
of interest and the time a monitor detects the property. 
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1 Introduction 


A reactive system [6] is characterized by a control program that interacts 
with an environment. The control program is input-driven: it monitors the 
environment and reacts to significant events by sending commands to the 
environment. There are many examples of reactive systems: for example, 
most embedded real-time systems are reactive systems, in which case the 
environment is an instrumented physical process. Non-real-time examples 
of reactive systems includes monitoring and debugging systems [4.13] and 
tool integration services [5,14]. 

In the Meta project [9,11], we have been developing tools that support 
the management of distributed applications through the use of a reactive 
system structure. Using Meta, the distributed application and its supporting 
services (for example, operating system, network servers, and hardware) can 
be instrumented with sensors that access its state and actuators that allow 
its state to be changed. Meta also provides a distributed interpreter of 
finite state automata that reference these sensors and actuators. Under 
Meta, control programs are translated into finite state automata that are 
executed by this distributed interpreter. Each interpreter executes guarded 
atomic commands of the form {$ — 5), meaning execute the action S in a 
state satisfying the global state predicate $. 

The problem addressed in this paper arises in the context of Meta: how- 
can a set of processes monitor the state of a distributed application in a 
consistent manner? For example, consider the simple distributed application 
shown in Figure 1. Each of the three processes in the application has a 
light, and the control processes would each like to take an action when 
some specified subset of the lights are on. The application processes are 
instrumented with stubs that determine when the process turns its light on 
or off. This information is disseminated to the control processes, each of 
which then determines when its condition of interest is met. 

Meta is built on top of the ISIS toolkit [1], and so we fust built the 
sensor dissemination mechanism using atomic broadcast. Atomic broadcast 
guarantees that all recipients receive the messages in the same order and 
that this order is consistent with causality [7]. Unfortunately, the control 
processes are somewhat limited in what they can deduce when they find 
that their condition of interest holds. 

For example, Figure 2 shows a space-time diagram of an execution of 
the application shown in Figure 1. In this figure, a process turning its light 
on is represented by a rectangle and the process turning its light off is rep- 
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Figure 1: A Monitored Distributed Application 


resented by a vertical line. Assume, for the moment, that this system is 
asynchronous, meaning that there is no bound on message passing delays or 
on the relative speeds of processes. In this case, the only ordering relations 
between events that can be determined from within the system are those of 
potential causality. Two events that are not so related are concurrent. In 
Figure 2, the events a and b are concurrent as are a and c. so the control pro- 
cesses could receive these event notifications (as sent by atomic broadcast) in 
one of these orders: (a;6;c), (6;a;c) or (6;c;a). Thus, the control processes 
may or may not determine that pi’s and pi ' s lights were on simultaneously, 
but they will reach the same decision. On the other hand, the events a. d 
and e are causally ordered, so the control processes will determine that pi’s 
and p 3 ’s lights were on simultaneously. 

Given a global property there are at least two ways that "detecting 

can be interpreted: 

1. There is an execution (i.e. a linear sequence of events) consistent 
with the observed behavior such that $ was true at a point in that 
execution. We will refer to this property as ]x>ssibly <b. In the space- 
time diagram shown in Figure 2, the predicate possibly (p,'s light on 
and p 2 s light on) holds. 
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Figure 2: Space-Time Diagram of Application Execution 

2. For all executions consistent with the observed behavior, there was 
some point in real time at which the global state of the system satisfied 
$. We will refer to this property as definitely $. In the execution 
shown in Figure 2, the predicate definitely (pi’s light on and 713 s light 
on) holds, since the event of P 3 turning its light on happened between 
pi turning its light on and pi turning its light off. 

Note that definitely $ is stronger than possibly <&. Hence, we will want 
to guarantee that if a control program determines possibly $ for a set of 
local states, then no control program will ever determine definitely -><& for 
the same states. Note that both of these conditions refer to some past state 
or states. 

In this paper, we give formal definitions for these two interpretations 
and present protocols for them. We first give the protocols for an asyn- 
chronous system. These protocols can take an unbounded amount of time 
to detect their condition of interest; furthermore, they can have substan- 
tial running times because they may have to enumerate may possible global 
states. However, no better is possible, in general, due to the nature of 
asynchronous systems. We then modify these protocols for a system with 
approximately synchronized clocks and bounded message delay. These pro- 
tocols are more practical, and we give upper bounds on the time between 
the occurrence of the property of interest and the time a control program 
detects the property. The existence of such a bound makes these protocols 
more useful in real systems. 

Snapshot protocols for computing global states of a distributed system [2] 
are related to the protocols described in this paper, but they suffer from a 
limitation similar to that of the atomic broadcast implementation described 
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above. In particular, if S is the global state computed by the snapshot, 
then there exists a legal execution of the system containing S. and so $(S) 
implies possibly $. Snapshot protocols are well-suited for detecting stable 
properties, which are those that, once they become true, the remain so. It 
may be the case that possibly $ holds of an execution, but the snapshot 
protocol never detects it (this can happen if $ is not stable). 

A recent dissertation by Spezialetti [16] looks at a broader set of issues 
than those covered in this paper, such as using semantic information (like 
relative stability) to determine which local events could make a global prop- 
erty true. This dissertation also presents protocols whose specification is 
similar to ours. However, her protocols that detects event occurrence suf- 
fer from the same limitation as snapshots and the atomic-broadcast- based 
protocol described above. Additionally, Spezialetti’s protocols do not take 
into account the ordering of events established by the underlying system's 
communication. We have also looked at the problem addressed in this paper 
when environments are continuous state transition systems [8]. Such sys- 
tems have the useful property that physical variables can, in many cases, be 
interpolated forward. By doing so, the monitor can reason about the current 
state of the physical process rather than a past state, and so possibly <h and 
definitely $ can be determined for the current state. 

2 Definitions 

We first define the notion of an execution of a system. A system is composed 
of processes, some of which are part of the application being run and some 

of which are part of the monitoring control program. Let {p t p n } be 

the set of application processes; for the sake of simplicity, we assume that 
there is only one monitoring process, denoted p 0 . Each pair of processes is 
connected by a point-to-point, reliable, FIFO communication link, and we 
assume that processes do not exhibit faulty behavior. 

Each process p, has a local state s, , which changes when an event occurs 
at the process. An event may be completely internal to the process, or it 
may be the sending or receipt of a message (e.g., ‘‘send m x to pf or "receive 
m 2 from p*. ”). For the sake of simplicity, we assume that all message sent in 
the system are unique. Process p,’s local history, denoted /i, is a (possibly 
infinite) sequence of states and events 


In this case, s ° is p,-’s initial state, and the first event it executes is e-. 
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after which the process’s state is sf, etc. A global state is a tuple 5 = 
(s 0 , si, . . ,,s n ), one for each process. Although the monitor, p 0 . is a process 
in the system, when we refer to a global state, we will usually mean only the 
global state of the application, (si,...,s n ). A global history (or history) is 
a tuple H = (h 0 , hi, . . . , h n ) of local histories, one per process. 

Although a global history does not specify the relative timings of events 
and states at different processes, it does allow us to draw certain conclusions 
about these timings. An event e\ happens before ef (written ej — ej 1 ) if 
one of the following is true [7]: 

• the events are at the same process and occur in the order indicated, 
that is, if i = j and / < m; 

• ej is the sending of a message by p,- to p } and e™ is the receipt of that 
message; or 

• there is another event e£ such that ej — - e£ and e£ — e™ ■ 

The “happens before” relation can be used to reason about the possible 
executions associated with a global history. We associate with each global 
history a set of linearizations. 

A linearization L of a history H is a sequence of global states and local 
events 

S°e 1 S l e 2 S 2 • • • 

that contains exactly the events in H such that, if e m — e n in H, then m < 
n. Notice that no prefix of L contains the receipt of a message whose sending 
does not appear in that prefix. In synchronous systems (see Section 3.3), 
there are further constraints on the linearizations of a global history. 

(The above definition of linearization assumes that, in the actual ex- 
ecution of a distributed system, no two events can occur simultaneously. 
This need not be the case; it is possible that events at different processors 
may occur at exactly the same time. We can easily extend our definition of 
linearization to include such definitions.) 

We use the notion of a cut to represent the global states that could have 
occurred in the execution. A cut [2] of a global history H is a tuple of 
natural numbers . , t n ) that represents the state of the system after t t 

events have executed at process p,; that is, the cut represents the global state 
(s\ l , . . , , .sj,"). Only certain cuts of a global history can truly correspond to 

global states that took place at some real time. A cut (t\ t n ) of H is 

consistent if there is some point in some linearization L of H by which each 
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process p* executed exactly t x events. L is said to pass through this cut. We 
will also refer to the associated tuple of events, , . . . , ef,"^ as a consistent 
cut. 

We want to be able to reason about certain facts (such as “possibly 
$”) being true in different global histories. To this end, we introduce the 
following notation. Let H be some global history of the system. To formally 
define “ possibly we introduce the formulas ?$|C, where C is a consistent 
cut. Formally, ?$| C holds for history H if C = (fi,...,f n ) is a consistent 
cut of H and $ holds for the global state (sj 1 , . . .,5^"). If ?$|C holds for H. 
then it is possible that $ held during the execution that generated H since 
it held at some point in some linearization of H. 

To formally define u definitely we introduce the formulas !4>|.4. where 

A is a finite set of cuts. Formally, !$|A holds for H if A is a finite set of 
consistent cuts of H , every linearization of H passes through some cut in .4. 

and for all cuts (t\,...,t n ) £ A, $ holds for the global state (s' 1 $*»). 

If !$|A holds for H , then $ definitely held at some point in the execution 
that generated H because it held at some point in all linearizations of H . 

Note that the definitions of these formulas satisfy two properties dis- 
cussed earlier. The definitely operator ! is clearly stronger than the possibly 
operator ? in the following sense: if !$|.4 holds for H then for any C £ .4. 
?$|C holds for H. Furthermore the two operators are, in a certain sense, 
dual. If !->$|A holds for H , then ?$| C cannot hold for H if C £ .4. 

Informally, the control process po detects possibly $ when it can deter- 
mine that there is a consistent cut of H that satisfies and po detects 
definitely $ when it can find a finite set of consistent cuts A such that every 
linearization of H passes through a member of A and such that $ holds 
for every member of A. We are investigating the more formal definition of 
detection, but we do not present such defintions in this version of the paper. 

3 Protocols 

As noted above, system consists of n+1 processes {po,pi ,p n } whose only 

method of interaction is by exchanging messages. The process po monitors 
the other processes to determine when some state predicate becomes true. 
This state predicate of interest will be of the form possibly $ or definitely 
$, where $ is a predicate over the states of the processes p\ p n . 

Each process p, will know how to compute $ and will send a message 
to po when its local state changes in a way significant with respect to 4>. In 
particular, a process can determine whether a local event potentially changes 
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$. More formally, let $ be a predicate expressed over a global state; that 
is, $(si, . . .,s n ) is true or false. Consider some event e- of process p t ; recall 
that a) -1 is the value of s+ before e\ executes and s- is the value of -s, after 
e\ executes. Event e- potentially affirms $ if the execution of e\ could have 
made $ true; 

3si , . . . , 3,'— 1 , Si+l , • • • , $n : > • • • 1 s i , • • • . 3 n ) A $(3i, . . . , S,-, . . . . S„ ). 

Similarly, event e' potentially rejects $ if the execution of e ■ could have made 
$ false: 


3si, . . . , 1 , 3j+i , . . . , S n : $(Si,...,Sj 


i- 1 


A 3 


An event potentially changes $ if it potentially affirms or rejects such an 
event is also called a relevant event. 

Note that an event can both potentially affirm and reject 4>. For example, 
if n > 4 and $ is “either two or three processes have their lights on.’’ then 
when a process turns its light on, this action both potentially affirms and 
rejects $ even though it is possible that the value of $ did not change. 

Our detection protocols will have the monitored processes periodically 
send to the monitor its state relevant to that is, the message will contain 
the values of the variables of p, referred to in <J>. For each process p, ( 1 < i < 
n), process po maintains a sequence Q, of such messages received from p,. 
These messages will also carry information for ordering these states, which 
is described next. 


3.1 Weak Vector Clocks and Enumeration of Global States 

Our protocols will have the monitor enumerate possible global states of the 
system by choosing states from each of the message sequences Q,. In this 
section, we describe how this enumeration of global states is performed. We 
use a slight modification of vector clocks [12]. 

A logical clock [7] is a value T that satisfies the clock condition : given 
two events e-y and e 2 and their associated clock values T{e x ) and T{e 2 ). if 
Cl _ e2i then T(ei) < T(e 2 ). We will find it advantageous to use clocks that 
also satisfy the converse of the clock condition; that is. clocks that satisfy 

(«i-*ej)»r(ei)<r(«,). (1) 

In particular, such clocks enable one to determine whether or not two esents 
are concurrent; ei and e 2 are concurrent if neither e x — e 2 nor e 2 — e x . 
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Unfortunately, Lamport s logical clocks of [7] (which are implemented using 
a single counter) do not satisfy Equation 1. 

A logical clock that satisfies Equation 1 can be implemented with a 
vector V of n counters. If Vi is the logical clock associated with process p , , 
then Vi[t] is the number of events that have been executed by p, and V t [j}. 
j ^ i, is the number of events- that p, “knows” p : has executed. If e, is 
an event at process i, then we use V(e,) to denote the value of after e, 
executed. Given this definition, one can easily show that e, — e 3 if and only 
if the vector clock of e } records the fact that e, has occurred: 

e, — ej o V'(c,-)[»] < V(^)[i]. (2) 

Similarly, if e, and e 3 are concurrent, then 

U(e t )[i] > V(ej )[i] A V(ej)[j] > K(e,-)[>]. 

If the set of processes is static, then vector clocks are not hard to im- 
plement. Initially, V,\j] is set to zero for all t and j. V'[i] is incremented 
whenever p, executes an event. Every message sent by p, is timestamped 
with Vi (let V(m) refer to the timestamp on message m). If e, is the receipt 
of message m, then each Vi[k) {k ^ i) is set to the maximum of \)[k] and 
U(m)[A:]. As an example. Figure 3 shows the values of vector clocks for the 
events of the execution shown in Figure 2. 

We can use vector clocks to determine whether or not a set of local 

states represents a consistent cut. The set of local states 5 = (sj $ n ) is 

a (consistent) global state if every pair of local states s, and s } is potentially 
concurrent. In terms of vector clocks. s t and sj are potentially concurrent 
if U(si)[i] > U(s ; )(t] and V” )[j] > U(s,)[j]. Thus, the global state S is a 
consistent cut if and only if 

Vi, J : 1 < i,j < n : K( $,■)[»] > V'(sj)[i]. (3) 

Because we are interested only in the causal relationship of events that 
potentially change $, we can use a slight weakening of vector clocks [10]. 
With our clocks, process p, will increment its local counter V ' [ z ] only when 
it executes an event that potentially changes It will send a message 
to po whenever its vector clock changes — that is. either when it executes a 
relevant event or when it executes a receive event through which it learns 
that another process has potentially changed $. The message sent from p, 
to po will contain p,’s state s, after such an event is executed and the vector 
time F(s,). 
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(1,0,0) (4,5,0) (6.5.4) 



Figure 3: Vector Clocks for Events of Figure 2 

Figure 4 illustrates such vector clocks. These clocks are a weakened 
version of normal vector clocks — for example, if i = j. they need not satisfy 
Equation 2. They do, however, satisfy Equation 3. and this is all that 
our protocols need. For the sake of simplicity, the remainder of this paper 
assumes that all events — including send and receive events are relevant 
events and thus that our weak vector clocks are true vector clocks. 

3.2 Asynchronous Systems 

In this section, we assume that processes do not possess local real-time 
clocks, that there is no global clock, and that there is no upper bound on 
message delays. We note in advance that there is no way to bound the 
amount of time between the time a condition becomes true and the time the 
monitor detects the condition. This is because messages sent to the monitor 
may be arbitrarily delayed. 

The protocols for detecting possibly $ and definitely $ are based on the 
same data structure: the lattice of consistent global states that correspond 
to an observed execution. Such a lattice consists of n orthogonal axes, with 
one axis for each monitored process. A point t = (fi, <2> • • • • In) * n this lattice 
corresponds to a consistent global state in which process p; has executed t, 
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(1,0,0) (2,2,2) 



Figure 4 : Weak Vector Clocks for Events of Figure 2 


events. Of course, not all tuples t n ) appear in the lattice; this 

depends on the causal dependencies among the local states of P. Define the 
/ere/ of a point t to be the sum of its indices t\ + + . . . + t n . 

Consider some global history. A linearization of this history is a total 
order of (consistent) global states in which exactly one process executes one 
event between adjacent global states In terms of the lattice corresponding 
to the history, a linearization corresponds to a path in the lattice, where the 
level of each subsequent point in the path increases by one. A space-time 
diagram of a two-process system and the corresponding lattice of global 
states is illustrated in Figure 5 . A point S X) represents a state in which 
process p\ is in its i th state and process p 2 is in its j th state. From the lattice, 
it is easy to see that one possible execution corresponds to the sequence of 
global states 

Sqo; 5 oi; S u ; 5 2 i; S 22 ; *$23; S33I $43 

For every point t in a lattice, there exists at least one linearization that 
passes through 7 . Hence, if any point in the lattice satisfies then possibly 
$ holds. The property definitely $ requires all linearizations to pass through 
a point that satisfies For example, suppose in Figure 5 that the points 5*43 
and 534 both represent states that satisfy then definitely <£ holds. This is 
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because 5 43 and S 34 are the only points in level 7 and all linearizations must 
pass through some point in that level. Definitely $ also holds if instead the 
states represented by points in the set {5 33 , 5 3 s, S 54 , 5 43 } all satisfy This 
is because if a linearization does not pass through 5s 3 or 5 3 s, then it must 
pass through S 44 and hence through either S. 5 4 or 5 43 . 

Figures 6 and 7 give the high-level algorithms that a monitoring process 
uses to detect possibly $ and definitely $, respectively, in an n-processor 
system. Each algorithm begins by having the monitor distribute the predi- 
cate $ to all processes and then construct the initial global state of level 0. 
(It is assumed that the monitor knows a priori each process's initial state 
relevant to if this is not the case, the processes begin by performing a 
two-phase synchronization protocol.) 

The possibly $ algorithm is straightforward: using the messages it re- 
ceives, the monitor iteratively constructs levels of the lattice, using the vec- 
tor timestamps accompanying each message (see below). If it ever finds a 
global state in the current level satisfying then it reports possibly $ and 
halts. Note that this protocol is not optimal in its reporting time because 
it always waits for a level to be completely enumerated. This restriction is 
not necessary and is only done to simplify the presentation of the algorithm 
and the one that follows. 

The definitely $ algorithm also iteratively constructs one level at a time. 
It attempts to prove that all paths in the lattice pass through a state sup- 
porting To this end, when constructing a new level, it adds only states 
that do not support $; call the resulting level a reduced level. If the monitor 
ever finds an empty reduced level, then the monitor halts and reports defi- 
nitely $ (in fact, it can report that $ definitely holds by the time processes 
execute a total of Ivl relevant events, where Ivl is the last level enumerated). 

As stated earlier, the implementations of the algorithms in Figures 6 
and 7 require a monitored process to send the relevant part of its local state 
to the monitor whenever its vector clock changes. The monitor maintains 
sequences of these states, one per process, and assembles them into the 
necessary global states. Thus, the monitor must be able to determine when 
it can assemble all the reachable global states of a given level and when 
it can drop a local state from its sequence because the local state cannot 
appear in any further global states of interest. To achieve this, we use weak 
vector timestamps developed in the previous section. 

Let Qi be the sequence of messages that po has received from p, stored in 
FIFO order. Each state s, in a message stored in Q, is labeled with the weak 
vector timestamp V(«j) of the event that generated that state. Equation ■) 
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defines when a set of states (si , $?, . . . , s „ ), with s,- from process p,-, comprise 
a consistent cut. Note that the level of this global state is £" =1 U(s u )[u]. 

Consider some point t = (t x ,...,t n ) in the lattice that corresponds to 
the state (si, . . .,s n ). The monitor can enumerate points of the next level 
in the lattice as follows. For each process p,, the monitor checks to see if .s', 
the state in the + 1)*' message of Qi , is potentially concurrent with the 
other Sj's (if there is no such state in Qi, the monitor cannot complete the 
next level until it receives that state). Thus, if 

Vj : j * i ■ V(s')[i] > V^Hi] A V{sj)\i] > V(s')[j] (4) 

then point (tj, . . . , t, + 1, . . . , f„) is in the lattice at the next level. (Although 
many such states will have be checked, it should be clear that a state at some 
level in the lattice may follow from several in the previous level: it only has 
to be checked once and not for each possible predecessor.) 

We can also use vector timestamps to determine when a message con- 
taining state 3{ can be eliminated, in the interest of saving space, from a 
queue Qi. If the last state in each other queue happens after .s, and is not 
potentially concurrent with it, then no state subsequently received could 
possibly form a global state with s, . Thus, the message containing .<*,• can 
be removed from Q, as soon as the following holds: 

V; : j ^ i : V(Q r last)[i] > V'(s; )[/], 

where Qj.last is the last state in Q r 

The running time of both detection algorithms are linear in the number of 
global states. Unfortunately, the number of global states can be exponential 
in the number of processes. Even worse, the worst-case space complexity 
is unbounded, since the delivery of a message can be indefinitely delayed 
in an asynchronous system. While there are heuristics that can be used to 
limit the number of constructed global states, they are intrusive in that they 
require some kind of synchronization or limited blocking of the monitored 
processes. Real-time bounds on communication and the rate of change of 
local states can also be used, as is discussed in the next section. 

3.3 Partially Synchronous Systems 

In this section, we assume that each process p, has a real-time clock C,. and 
that these clocks are approximately synchronized: at any given "real" time, 
the difference between the clocks of two processes is no more than €. We 
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define this formally by modifying our definition of histories and linearizations 
slightly. Firstly, all processes (including p 0 ) execute “tick” events: a process's 
local time is the number of tick events that it has executed. If e, is an event 
at pi, then C,(e,) is the number of tick events that p, has executed through 
e,. If# is a history with approximately synchronized clocks, then L is a 
linearization of H only if, in addition to the usual requirement, in all prefixes 
of L and every pair of processes pi and pj, the difference in the number of 
tick events executed by the two processes is at most e. 

In addition to approximately synchronized clocks, we assume that there 
are lower and upper bounds on message transmission times. This means that 
if process p, executes “send m to p,” after it has executed t 3 tick events, then 
when pj executes “receive m from p,,” it has executed t r tick events, where 
t, + dmin < tr < t, + d max for constants d min and d m ax (both greater than 
0). These bounds will be especially important when considering messages 
received by the monitor p 0 . Approximately synchronized clocks can be used 
to extend the “happens before” relation to order two events e; and e, even 
when there is no explicit communication between p; and p } \ thus, we redefine 
e, — tj\ 

e, - tj O (V(e,)[i] < V(Cj)[*]) V(C(ci) + c < 

That is, e t must happen before ej either if e t can causally affect ej (as 
measured by vector timestamps) or if the clock times corresponding to the 
events show that e x must happen first. 

Our protocols will be such that each state s x sent bv a process p t to the 
monitoring process po will include the local time C(s,) at which the event 
resulting in s, occurred, as well as the vector timestamp V(s t ). The monitor 
can then use the vector timestamps and the clock times of these states to 
enumerate the levels of the lattice. The clock times can be used to further 
restrict the pairs of events that are potentially concurrent. With each state 
Si in Qi , the monitor can determine the latest local time at which p t must 
have been in state s t - (call this L(s t )). If there is a state s f t after in Q x , 
then this is C(s'); if Si = Q t .last , then this is C - rf ma x, where C is the 
monitor’s current local time. If pi had changed its local state between C {s t ) 
and C - dmaxt then the monitor would have gotten another message from p t 
by its local time C. 

1 There is no need to take the transitive closure of the two relations because, if d min > 0. 
V(e t )[i] < V{ej)[i] and C(cj) + e < C(e k ) then + ^ < C{e k ). and if C(e,) + e < C[t } ) 
and V(ej)[j\ < V(e k )\j] then C(e t ) -ft < C{e k )• 
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We can now say that two states and sj received by the monitor are 
potentially concurrent if both the vector time stamps and the the real-time 
clocks indicate this: 

(VisiM > V(sj)[j}) A (V(si)\j]>V(*i)M) 

A {{C{3i) - €) < C(sj) < L(s t ) + € (5) 

V (C(sj) — e) < C{si ) < L(Sj ) + e). 

Suppose now that the monitor is seeking to extend the state (s x s n ) to 

the next level by potentially adding a new state .s', the (t t + 1) J * state in Q x . 
It checks to see if 3 ' is potentially concurrent with the other Sj's by using 
Equation 5 instead of Equation 4. If s[ is potentially concurrent with all 
the Sj' s, then the state (s x , . . , ,s', . . . ,s n ) is added to the next level of the 
lattice; otherwise, it is not. An exception to the last point is if = Q t Aast 
and was not deemed to be potentially concurrent because its latest time 
was too early. For example, suppose 6=1 and 

C(s[) = 3; £($•) = 4; C(s ; ) = 6; L($j) = 7. 

Because s[ = Q x Aast , L(s[) = C - d m ax ; as time passes on the monitor's 
clock, £(<$■) may grow so that the two states would be judged potentially 
concurrent. In such cases, therefore, the decision about whether to add 
the state (si, . . . , s', . . . , s n ) to the lattice is postponed until either another 
message arrives from p t or the monitor's clock advances to a point where a 
decision can be made. Until then, the level cannot be completely enumer- 
ated. 

The conditions possibly $ and definitely $ can now be detected exactly 
as in the previous section. Each processor sends its state to the monitor 
whenever its vector clock changes; it includes with this message its vector 
time and the number of tick events it has executed. The monitor then uses 
this information to construct levels of the lattice, using the properties of the 
“potentially concurrent” states discussed above. It then reports possibly 4>" 
or u definitely exactly as it would in the case of asynchronous systems. 

We now argue upper bounds on detection times. Suppose that S = 
(si,...,s n ) is a global state such that the last event leading to this state 
occurs when the monitor’s local time is t. No process's local clock is higher 
than t + 6 when one of the events leading to S occurs, so p 0 receives all 
messages necessary to construct this state by local time t + e + r/ max . Local 
time t + 2e is the latest that a process could execute an event that could be 
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potentially concurrent with one leading to 5; thus, by time t + 2e + ^max' Po 
will have completed the construction of the level containing 5. 

Suppose that possibly $ holds of a history; this means that some consis- 
tent cut of the history supports $. If the last event leading to this cut occur 
when the monitor’s local clock is t, then the monitor will finish enumerating 
the level of 5 at its local time t + 2e + d max , detecting possibly $ at that time 
(actually, it could detect it at time t + e-Mmax because, as noted earlier, the 
possibly protocol does not need to enumerate an entire level once it finds a 
state satisfying $). 

Suppose that definitely $ holds of a history; this means that there is some 
finite set of consistent cuts, all supporting $, through which all linearizat ions 
pass. If the last event leading to the last of these occurs when the monitor's 
local time is t, then the monitor must detect definitely $ by time f + 2e+c/ ma x 
on its local clock; this is because the last state in the level of the last cut 
will be enumerated by that time, and the protocol will halt. 

The above discussion does not consider the amount of local computation 
required by the monitor. In general, this depends on the relation between e 
and the rate at which processes can potentially change If clocks are closely 
synchronized, then the monitor will never have to consider more than a few 
state changes by any one process. If instead processes potentially change $ 
very often, then the monitor may have to do significant local computation. 

4 Conclusions 

This paper has defined two means ( possibly and definitely) by which global 
states in an asynchronous or loosely synchronous system can be detected. 
It presented algorithms by which a monitor can detect these properties in 
both types of systems. There are other means of detection that are also of 
interest. For example, we have been investigating a third type of detection, 
called currently , that occurs when the monitor learns a condition actually 
holds at the time of detection. One can modify our definitely algorithms 
for partially synchronous systems to detect currently by requiring that ap- 
plication processes forgo potentially rejecting the condition being detected 
for a well-defined amount of time. We can obtain currently algorithms for 
asynchronous systems only by forcing application processes to block. 

These algorithms may be complex, both in terms of computation and 
storage. Although we are investigating optimizations of these algorithms, 
we maintain that significant complexity is required for detection to be com- 
plete. In the future, we plan to look at the kinds of information that may 
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simplify the detection. If the property that is to be detected. 4>, has cer- 
tain nice properties, then detection may be simplified. If the monitor has 
some knowledge of the how and when the application program potentially 
changes the condition to be detected, then this can also simplify detection. 
We have also been investigating casting the detection problem into temporal 
and epistomological logics. We believe that such a characterization will aid 
in finding sets of properties under which detection can be simplified. 

Although our original application was towards distributed application 
management, we have also been investigating the use of these detection 
protcols in the scope of debugging distributed systems [3]. The constraints 
of a debugger are slightly different from those that arise in tool integration 
or distributed application management. For example, invasiveness is tradi- 
tionally considered untolerable, yet in tool integration, temporarily blocking 
an application may be acceptable. 

The work most similar in spirit to ours are the protocols developed bv 
Spezialetti [16]. In particular, her event holding condition is the same spec- 
ification as our protocol for detecting currently and the specification of 
her event occurrence condition is similar to the specification of our possibly 
$ algorithm. However, her protocols for non-local event detection are in- 
complete, in that they can miss conditions that in fact held. For example, 
the execution in Figure 8 shows such an execution. If the messages in this 
figure correspond to the messages generated in establishing simultaneous 
regions [15], then her protocol will not detect vy = x?. yet in fact definitely 
Xi = X 2 holds. 
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Figure 5: An execution and the corresponding lattice of global states. 
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Possibly($): begin 

current global state (sj, • ■ • , s „) ; 

Ivl := 0; 

do no state in current satisfies <£ — ♦ 

/as* := current 
Ivl := /u/ -f 1; 

current := states of level Ivl reachable from a state in last ; 
od 
end; 

report Posstbly $ 


Figure 6: Algorithm for detecting Possibly <£. 


Definitely(<£): begin 

last := global state (sj, . . . , s°); 
remove all states in last that satisfy <I>; 

Ivl := 1; 

% Invariant: /as* contains all states of level Ivl - 1 that are accessible 
% from ( 

^1 1 ^2 » * * * 1 $n) without passing through a state satisfying <P 
do last ^ { } — ■ 

current := states of level Ivl reachable from a state in last 
remove all states in current that satisfy 
Ivl : = Ivl + 1; last := current 
od 
end; 

report Definitely $ 


Figure 7: Algorithm for detecting Definitely 
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Xi := 4 X\ := 3 



Figure 8: $ = (^i = X 2 ) 
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